class: center, middle, inverse, title-slide #
Decision Trees and Random Forests
## .big[🤔 🌴 🍂️ ] ### Applied Machine Learning in R Pittsburgh Summer Methodology Series ### Lecture 3-B July 21, 2021 --- class: inverse, center, middle # Overview <style type="text/css"> .onecol { font-size: 26px; } .twocol { font-size: 24px; } .remark-code { border: 1px solid grey; } a { background-color: lightblue; } .remark-inline-code { background-color: white; } </style> --- ## Lecture Topics .pull-left[ **Simple Decision Trees** - Motivation (modeling nonlinearity) - Classification trees - Regression trees - Recursive partitioning - Pruning - Stopping criteria (e.g., maximum depth) **Ensemble Methods** - Random forests - Aggregating predictions from many models - Improved prediction accuracy at the cost of lower interpretability ] .pull-right[ <img src="data:image/png;base64,#flowchart2.jpg" width="400" height="450" /> ] --- ## Geometry of Data Thus far, we have only been modeling linear relationships with linear boundaries between classes, e.g.: <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-2-1.png" width="100%" /> --- ## Geometry of Data But what about other types of relationships? <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-3-1.png" width="100%" /> --- ## Geometry of Data But what about other types of relationships? <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> --- ## Geometry of Data Whereas these classes are very clearly separated, it's no longer easy to use a **single equation** to describe the boundaries between them. .pull-left[ <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> ] .pull-right[ <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-6-1.png" width="100%" /> ] --  allow us to build models that are capable of describing these complex decision boundaries while maintaining ease of interpretability. --- class: inverse, center, middle # Decision Trees --- ## Decision Trees Tree-based models use the logic of `if-then` statements in order to partition the data into . For instance: -- `if has legs` </br> `| if barks then animal = dog` </br> `| else animal = cat` </br> `else animal = fish` -- This is an example of a simple **classification tree**. Decision trees can also be used for regression problems (i.e., **regression trees**). A key benefit of decision trees is that they are  and able to model nonlinear relationships. However, they typically provide  than other supervised learning methods (e.g., random forests, which we will talk about later today). --- ## Decision Trees .pull-left[ Decision trees are often visualized **graphically**. The top node is called the . Subsequent splitting nodes are called  or . The output labels are called  or . If a statement is , you go to the left. If a statement is , you go to the right. ] .pull-right[ <img src="data:image/png;base64,#simpletree.png" width="100%" /> ] --- class: inverse, center, middle # Classification Trees --- ## Building a Classification Tree The goal of classification trees are to partition data into homogeneous groups, as defined by  (i.e., including a larger proportion of one class than the other in each node). -- In building a classification tree, we use  to find the best data splits that maximize node purity. -- The <sup>[1]</sup> is the most commonly-used metric for quantifying purity, and is calculated as: `$$Gini = 1 - \sum\limits_{i = 1}^C(p_i)^2$$` where - `\(p_i\)` = the probability of being in the `\(i\)`th class - `\(C\)` = total number of classes .footnote[ The Gini index ranges from 0 - 1, with smaller values indicating greater purity. ] --- ## Building a Classification Tree Let's walk through an example of building a classification tree using this toy dataset to predict depression risk: Stressful Event  | Family History  | Age    | Depression Risk  :------- | :-------- | :------- |:------- | No | Yes | 10 | Low No | No | 12 | Low Yes | Yes | 16 | High Yes | Yes | 22 | High No | Yes | 30 | High No | No | 38 | Low Yes | No | 46 | Low -- The first thing we need to do is choose the  by determine which feature (stressful life event, family history of depression, or age) best predicts future depression risk. --- ## Choosing the Root Node .pull-left[ Start by finding the **Gini index** of stressful life events.   | Family History  | Age    |   :------- | :-------- | :------- |:------- |  | Yes | 10 |   | No | 12 |   | Yes | 16 |   | Yes | 22 |   | Yes | 30 |   | No | 38 |   | No | 46 |  ] -- .pull-right[ <img src="data:image/png;base64,#stresstree.png" width="90%" /> ] --- ## Choosing the Root Node .pull-left[ Both terminal nodes (leaves) are **impure**, as they include people with high and low depression risk. ] .pull-right[ <img src="data:image/png;base64,#stresstree.png" width="90%" /> ] --- ## Choosing the Root Node .pull-left[ Both terminal nodes (leaves) are **impure**, as they include people with high and low depression risk. To quantify this impurity, we first calculate the **Gini impurity** of each leaf: `\(Gini_{leaf} = 1 - P(High)^2 - P(Low)^2\)` ] .pull-right[ <img src="data:image/png;base64,#stresstree.png" width="90%" /> ] --- ## Choosing the Root Node .pull-left[ Both terminal nodes (leaves) are **impure**, as they include people with high and low depression risk. To quantify this impurity, we first calculate the **Gini impurity** of each leaf: `\(Gini_{leaf} = 1 - P(High)^2 - P(Low)^2\)` `\(Gini_{left} = 0.444\)` `\(Gini_{right} = 0.375\)` ] .pull-right[ <img src="data:image/png;base64,#stresstree.png" width="90%" /> ] --- ## Choosing the Root Node .pull-left[ Both terminal nodes (leaves) are **impure**, as they include people with high and low depression risk. To quantify this impurity, we first calculate the **Gini impurity** of each leaf: `\(Gini_{leaf} = 1 - P(High)^2 - P(Low)^2\)` `\(Gini_{left} = 0.444\)` `\(Gini_{right} = 0.375\)` We then calculate the total Gini index by taking the weighted average of the Gini leaf indices: `\(Gini_{stress} = (\frac{3}{7})*0.444 + (\frac{4}{7})*0.375 = 0.405\)` ] .pull-right[ <img src="data:image/png;base64,#stresstree.png" width="90%" /> ] --- ## Choosing the Root Node We can compare this to the Gini index for family history, which comes to: `\(Gini_{family}=((\frac{4}{7})*(1 - (\frac{3}{4})^2 - (\frac{1}{4})^2)) + ((\frac{3}{7})*(1 - (\frac{0}{3})^2 - (\frac{3}{3})^2)) = 0.214\)` .pull-left[ Stressful Event  |   | Age    |   :------- | :-------- | :------- |:------- | No |  | 10 |  No |  | 12 |  Yes |  | 16 |  Yes |  | 22 |  No |  | 30 |  No |  | 38 |  Yes |  | 46 |  ] -- .pull-right[ <img src="data:image/png;base64,#familytree.png" width="90%" /> ] --- ## Choosing the Root Node Calculating the Gini index for **numerical features** is slightly more complicated. We sort values from lowest to highest, calculate the midpoint of adjacent rows, and use these cutoffs to find the lowest Gini index for all splits. </br> </br> </br> .pull-left[ Stressful Event  | Family History  |     |   :------- | :-------- | :------- |:------- | No | Yes |  |  No | No |  |  Yes | Yes |  |  Yes | Yes |  |  No | Yes |  |  No | No |  |  Yes | No |  |  ] --- ## Choosing the Root Node Calculating the Gini index for **numerical features** is slightly more complicated. We sort values from lowest to highest, calculate the midpoint of adjacent rows, and use these cutoffs to find the lowest Gini index for all splits. In this case, the lowest Gini index is for age < 14.5 (Gini = 0.343): .pull-left[ Stressful Event  | Family History  |     |   :------- | :-------- | :------- |:------- | No | Yes |  |  No | No |  |  Yes | Yes |  |  Yes | Yes |  |  No | Yes |  |  No | No |  |  Yes | No |  |  ] .pull-right[ <img src="data:image/png;base64,#agetree.png" width="90%" /> ] --- ## Recursive Partioning .pull-left[ Comparing the Gini index for stressful life events (0.405) to family history of depression (0.214) to age<14.5 (0.343), we see that family history has the lowest Gini index (i.e., highest **purity**). Thus, we set family history as the root node. We then  to find the next split from the impure node. We can continue this process until we are left with . Note that if we are left with an  leaf, the label is set to the **mode** of all observations within the leaf. ] -- .pull-right[ <img src="data:image/png;base64,#familytree2.png" width="90%" /> ] --- class: inverse, center, middle # Regression Trees --- ## Building a Regression Tree Let's say we have data that look like this. How should we model these data? <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> --- ## Building a Regression Tree <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-17-1.png" width="100%" /> --- ## Simple Regression Tree A regression tree can solve this problem by partioning the data into more homogeneous groups, and using the mean of observations within a group to make predictions. .pull-left[ <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> ] -- .pull-right[ <img src="data:image/png;base64,#regtree.png" width="80%" /> ] --- ## Building a Regression Tree Similar to classification trees, the goal of regression trees are to partition data into homogeneous groups. The difference is that we're now predicting  at each terminal node (leaf), rather than classes. -- In growing a regression tree, we start with the entire data set `\(S\)` to find the optimal feature and splitting value that partitions the data into two groups `\(S_1\)` and `\(S_2\)` to minimize the sum of squared errors: `$$SSE = \sum\limits_{i \in S_1}(y_i-\bar{y}_1)^2 + \sum\limits_{i \in S_2}(y_i-\bar{y}_2)^2$$` where - `\(\bar{y}_1\)` = mean of training set outcomes in group `\(S_1\)` - `\(\bar{y}_2\)` = mean of training set outcomes in group `\(S_2\)` -- Within each `\(S_1\)` and `\(S_2\)` group, we then repeat this  process until the number of samples within each terminal node falls below some threshold (typically, `\(n=20\)`). The predicted value of the terminal node is then given as the  of all observations within that node. --- ## Comprehension Check <span style="font-size:24px;">**Let's say we continued this recursive partioning process until we were left only with pure nodes (classification tree) or minimal SSE (regression tree). </br> </br> What are some problems that could arise?**</span> -- </br> Some answers: - Overfitting training data - Poor prediction on test data/future/new data - Instability of model (if data are even slightly altered, you may find entirely different splits) - Small number of participants in leaves - Selection bias: features with higher number of distinct values are favored - Poorer interpretation --- class: inverse, center, middle # Preventing Overfitting --- ## Preventing Overfitting .pull-left[ If we let decision trees grow to their full capacity, the tree will continue to grow until each terminal node is entirely homogeneous (**100%** training accuracy). This would inevitably lead to **poor prediction** on testing data and any future datasets. To prevent this from happening, we need to  at a certain point, before it overfits the training data. There are two main methods for preventing overfitting in classification and regression trees:  and . ] .pull-right[ <img src="data:image/png;base64,#Day_3B_Slides_files/figure-html/unnamed-chunk-20-1.png" width="100%" /> ] --- ## Stopping Criteria .left-column[ <br /> <img src="data:image/png;base64,#stop.png" width="100%" /> ] .right-column[  prevent trees from continuing to grow if certain conditions are met Common conditions include: 1) Not splitting if the leaf is homogenous (standard) 2) Not splitting if the number of observations in a leaf will fall below some threshold, typically `\(n=20\)` (standard) 3) Not splitting if the  in the tree is above some threshold How do we decide what number of leaves to stop at? **Hyperperamter tuning** with the `maxdepth` parameter with `method = rpart2` in {caret}! While stopping criteria prevent us from overfitting a full tree, ] --- ## Stopping Criteria in R ```r heart <- read.csv("heart.csv") set.seed(2021) trainIndex <- createDataPartition(heart$output, p = 0.8, list = FALSE, times = 1) heart_train <- heart[trainIndex, ] heart_test <- heart[-trainIndex, ] heart_recipe <- heart_train %>% recipe(output ~ .) %>% step_num2factor(output, transform = function(x) x + 1, levels = c("low_risk", "high_risk")) %>% step_dummy(all_nominal_predictors()) heart_tree_md <- train(heart_recipe, data = heart_train, * method = 'rpart2', * tuneLength = 6, trControl = trainControl(method = 'cv', number = 10)) ``` --- ## Stopping Criteria in R .scroll-output-full[ ```r heart_tree_md ``` ``` ## CART ## ## 243 samples ## 13 predictor ## 2 classes: 'low_risk', 'high_risk' ## ## Recipe steps: num2factor, dummy ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 219, 219, 218, 219, 219, 219, ... ## Resampling results across tuning parameters: ## ## maxdepth Accuracy Kappa ## 1 0.7115000 0.4036249 ## 3 0.7403333 0.4565185 ## 5 0.7491667 0.4766371 ## 6 0.7823333 0.5406668 ## 9 0.7823333 0.5406668 ## 10 0.7823333 0.5406668 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was maxdepth = 6. ``` ] --- ## Pruning .left-column[ <br /> <img src="data:image/png;base64,#pruning.png" width="100%" /> ] .right-column[ An alternative to stopping criterion is . Rather than preventing a complex tree from growing, we can impose a **penalty** on complex trees with . This modifies the `\(SSE\)` to: `\(SSE_{c_p} = \sum\limits_{i \in S_1}(y_i-\bar{y}_1)^2 + \sum\limits_{i \in S_2}(y_i-\bar{y}_2)^2 + c_pT\)` where: - `\(c_p\)` = complexity parameter - `\(T\)` = number of terminal nodes Much like `\(\lambda\)` in regularized regression, the `\(c_p\)` penalty allows us we sacrifice some training accuracy to improve test accuracy. We use internal cross-validation to tune for optimal `\(c_p\)` parameters with `method = rpart` in {caret}. ] --- ## Pruning in R ```r heart_tree_cp <- train(heart_recipe, data = heart_train, * method = 'rpart', tuneLength = 6, trControl = trainControl(method = 'cv', number = 10)) ``` -- .scroll-output[ ```r heart_tree_cp ``` ``` ## CART ## ## 243 samples ## 13 predictor ## 2 classes: 'low_risk', 'high_risk' ## ## Recipe steps: num2factor, dummy ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 219, 219, 219, 219, 218, 219, ... ## Resampling results across tuning parameters: ## ## cp Accuracy Kappa ## 0.009174312 0.7571667 0.4989085 ## 0.015290520 0.7691667 0.5222375 ## 0.018348624 0.7525000 0.4893608 ## 0.022935780 0.7445000 0.4695195 ## 0.068807339 0.6990000 0.3681663 ## 0.467889908 0.6378333 0.2430868 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was cp = 0.01529052. ``` ] --- ## Decision Trees Summary While stopping criteria and pruning help reduce overfitting, single decision trees are still limited in overall , compared to other supervised learning methods. A single tree has excellent interpretability and is easy to explain to people. Decision trees also closely mirror human decision-making processes! -- However, a single tree is also typically **not flexible enough** to accurately classify or predict new data. Single trees can also be highly . The structure of trees might change dramatically with a small change in the training data. -- One solution to these problems is to  to improve predictive accuracy and model stability. --- class: inverse, center, middle # Random Forests --- ## Random Forests .left-column[ </br> <img src="data:image/png;base64,#randomforest.png" width="100%" /> ] .right-column[ The goal of random forests is to combine the simplicity of a single decision tree with greater model **flexibility**. Random forests use **bootstrapped aggregation** (i.e., ) to combine predictions from multiple decision trees together. Random forests also use methods to **decorrelate** trees for more reliable and less variable predictions. Taken together, this **ensemble method** allows for greater **predictive accuracy** in new datasets than any single classification or regression tree. However, this comes at the cost of **lower interpretability**. ] --- ## Building a Random Forest Let's return to this toy dataset to walk through the process of building a random forest model. .pull-left[ Stressful Event  | Family History  | Age    | Depression Risk  :------- | :-------- | :------- |:------- | No | Yes | 10 | Low No | No | 12 | Low Yes | Yes | 16 | High Yes | Yes | 22 | High No | Yes | 30 | High No | No | 38 | Low Yes | No | 46 | Low ] -- .pull-right[ The first step is to create a **bootstrapped dataset** from this original data. To do so, we can randomly draw (with replacement) samples from this dataset, to create a bootstrapped dataset of the same size. Importantly, this bootstrapped dataset will include some observations more than once. Other observations will be left out (note: this is called the ). ] --- ## Building a Random Forest **Step 2**: Use the bootstrapped dataset to build a decision tree, using  at each split. -- .pull-left[ **Bootstrapped Dataset**   | Family History  |     | Depression Risk  :------- | :-------- | :------- |:------- | Yes | Yes | 22 | High No | No | 38 | Low Yes | Yes | 16 | High No | No | 12 | Low No | Yes | 30 | High No | No | 38 | Low Yes | Yes | 22 | High ] -- .pull-right[ <img src="data:image/png;base64,#forest_split1.png" width="80%" /> ] --- ## Building a Random Forest **Step 2**: Use the bootstrapped dataset to build a decision tree, using  at each split. .pull-left[ **Bootstrapped Dataset** Stressful Event  |   |     | Depression Risk  :------- | :-------- | :------- |:------- | Yes | Yes | 22 | High No | No | 38 | Low Yes | Yes | 16 | High No | No | 12 | Low No | Yes | 30 | High No | No | 38 | Low Yes | Yes | 22 | High ] .pull-right[ <img src="data:image/png;base64,#forest_split2.png" width="80%" /> ] --- ## Building a Random Forest **Step 3**: Repeat steps 1 and 2! Generate another bootstrapped dataset and build another decision tree, using only a random subset of `\(m\)` features at each split. Repeat for many (e.g., 1000) trees. </br> <img src="data:image/png;base64,#manytrees.png" width="100%" /> --- ## Random Forests **Hyperparameter Tuning**: We can tune the hyperparameter `mtry` to find the  to select at each split using `method = rf` in {caret}. -- **Evaluation**: Evaluate accuracy of our random forest based on the proportion of  samples that were correctly classified. -- **Prediction**: When making predictions for new data, use the outcome with the  from all trees. -- **Summary**: By only using a  of features at each decision tree split, our trees are . Even if one feature has a very strong relationship with the outcome, it will not have an outsized influence in building our trees. Averaging the output of many uncorrelated trees is particularly effective at . Thus, random forests typically produce more  than single decision trees. --- class: inverse, center, middle # Live Coding --- ## Live Coding Activity **Live Coding**: I will walk through examples of decision trees and random forests in RStudio. - You can follow along in your own RStudio (easiest with 1 large or 2+ monitors) - Or you can download `Day_3B_Activity.Rmd` and follow along .footnote[ All files are on the workshop OSF page: https://osf.io/3qhc8/. ] -- </br> **Small Group Activity**: Afterwards, we will split you into small breakout room groups to practice building decision trees and random forests with a new dataset. If you have any questions, please post them in the chat or workshop Slack channel. We will also float between different breakout rooms to answer questions.